A Lightweight End-to-End Speech Recognition System on Embedded Devices
نویسندگان
چکیده
In industry, automatic speech recognition has come to be a competitive feature for embedded products with poor hardware resources. this work, we propose tiny end-to-end model that is lightweight and easily deployable on edge platforms. First, instead of sophisticated network structures, such as recurrent neural networks, transformers, etc., the mainly uses convolutional networks its backbone. This ensures our supported by most software development kits devices. Second, adopt basic unit MobileNet-v3, which performs well in computer vision tasks, integrate features hidden layer at different scales, thus compressing number parameters less than 1 M achieving an accuracy greater some traditional models. Third, order further reduce CPU computation, directly extract acoustic representations from 1-dimensional waveforms use self-supervised learning approach encourage convergence model. Finally, solve problems where resources are relatively weak, prefix beam search decoder dynamically extend path optimized pruning strategy additional initialism language capture probability between-words advance avoid premature correct words. experiments, according evaluation categories, outperformed several models used devices related work.
منابع مشابه
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC [6] while bei...
متن کاملEnd-to-end Audiovisual Speech Recognition
Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-toend audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the ...
متن کاملEnd-to-End Speech Recognition Models
For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional independence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect betw...
متن کاملMultichannel End-to-end Speech Recognition
The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend ...
متن کاملTowards End-to-End Speech Recognition
Standard automatic speech recognition (ASR) systems follow a divide and conquer approach to convert speech into text. Alternately, the end goal is achieved by a combination of sub-tasks, namely, feature extraction, acoustic modeling and sequence decoding, which are optimized in an independent manner. More recently, in the machine learning community deep learning approaches have emerged which al...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEICE Transactions on Information and Systems
سال: 2023
ISSN: ['0916-8532', '1745-1361']
DOI: https://doi.org/10.1587/transinf.2022edp7221